4 research outputs found

    Effective Record Linkage Techniques for Complex Population Data

    Get PDF
    Real-world data sets are generally of limited value when analysed on their own, whereas the true potential of data can be exploited only when two or more data sets are linked to analyse patterns across records. A classic example is the need for merging medical records with travel data for effective surveillance and management of pandemics such as COVID-19 by tracing points of contacts of infected individuals. Therefore, Record Linkage (RL), which is the process of identifying records that refer to the same entity, is an area of data science that is of paramount importance in the quest for making informed decisions based on the plethora of information available in the modern world. Two of the primary concerns of RL are obtaining linkage results of high quality, and maximising efficiency. Furthermore, the lack of ground-truth data in the form of known matches and non-matches, and the privacy concerns involved in linking sensitive data have hindered the application of RL in real-world projects. In traditional RL, methods such as blocking and indexing are generally applied to improve efficiency by reducing the number of record pairs that need to be compared. Once the record pairs retained from blocking are compared, certain classification methods are employed to separate matches from non-matches. Thus, the general RL process comprises of blocking, comparison, classification, and finally evaluation to assess how well a linkage program has performed. In this thesis we initially provide a holistic understanding of the background of RL, and then conduct an extensive literature review of the state-of-the-art techniques applied in RL to identify current research gaps. Next, we present our initial contribution of incorporating data characteristics, such as temporal and geographic information with unsupervised clustering, which achieves significant improvements in precision (more than 16%), at the cost of minor reduction in recall (less than 2.5%) when they are applied on real-world data sets compared to using regular unsupervised clustering. We then present a novel active learning-based method to filter record pairs subsequent to the record pair comparison step to improve the efficiency of the RL process. Furthermore, we develop a novel active learning-based classification technique for RL which allows to obtain high quality linkage results with limited ground-truth data. Even though semi-supervised learning techniques such as active learning methods have already been proposed in the context of RL, this is a relatively novel paradigm which is worthy of further exploration. We experimentally show more than 35% improvement in clustering efficiency with the application of our proposed filtering approach; and linkage quality on par with or exceeding existing active learning-based classification methods, compared to our active learning-based classification technique. Existing RL evaluation measures such as precision and recall evaluate the classification outcome of record pairs, which can cause ambiguity when applied in the group RL context. We therefore propose a more robust RL evaluation measure which evaluates linkage quality based on how individual records have been assigned to clusters rather than considering record pairs. Next, we propose a novel graph anonymisation technique that extends the literature by introducing methods of anonymising data to be linked in a human interpretable manner, without compromising structure and interpretability of the data as with existing state-of-the-art anonymisation approaches. We experimentally show how the similarity distributions are maintained in anonymised and original sensitive data sets when our anonymisation technique is applied, which attests to its ability to maintain the structure of the original data. We finally conduct an empirical evaluation of our proposed techniques and show how they outperform existing RL methods

    Efficient population record linkage with temporal and spatial constraints.

    Get PDF
    Objectives Population databases containing birth, death, and marriage certificates or census records, are increasingly used for studies in a variety of research domains. Their large scale and complexity make linking such databases highly challenging. We present a scalable blocking and linking technique that exploits temporal and spatial constraints in personal data. Approach Based on a state-of-the-art blocking method using locality sensitive hashing (LSH), we incorporate (a) attribute similarities, (b) temporal constraints (for example, a mother cannot give birth to two babies less than nine months apart, besides a multiple birth), and (c) spatial constraints (two births by the same mother are more likely to happen in the same location than far apart). In an iterative fashion, we identify highly confident matches first, and use these matches to further refine our constraints. We adopt a block size and frequency-based filtering approach to further enhance the efficiency of the record linkage comparison step. Results We conducted experiments on a Scottish data set containing 17,613 birth certificates from 1861 to 1901, where the application of standard LSH blocking generated approximately 15 million candidate record pairs, with a recall of 0.999 and a precision of 0.003. With the application of our block size and frequency-based filtering approach we obtained a ten-fold and hundred-fold reduction of this candidate record pair set with a small reduction of recall to 0.984 and 0.962, respectively. The comparison of record pairs in the hundred-fold reduction using our iterative linking technique achieved up-to 0.961 precision and 0.811 recall. This means that our method can achieve a reduction in computational efforts, and improvement in precision of over 99% at the cost of a decline in recall below 19%. Conclusion We presented a method to reduce the computational complexity of linking large and complex population databases while ensuring high linkage quality. Our method can be generalised to population databases where temporal and spatial constraints can be defined. We plan to apply our method on a Scottish database with 24 million records

    Evaluation measure for group-based record linkage

    Get PDF
    Traditionally, record linkage is concerned with linking pairs of records across data sets and the classification of such pairs into matches (assumed to refer to the same individual) and non-matches (assumed to refer to different individuals). Increasingly, however, more complex data sets are being linked where often the aim is to identify groups, or clusters, of records that refer to the same individual or to a group of related individuals. Examples include finding the records of all births to the same parents or all medical records generated by members of the same family. When ground truth data in the form of known true matches and non-matches are available, then linkage quality is traditionally evaluated based on the classified versus the true matches (links) using measures such as precision (also known as the positive predictive value) and recall (also known as sensitivity or the true positive rate). The quality of clusters generated in record linkage is of high importance, since the comparison of different linkage methods is largely based on the values obtained by such evaluation measures. However, minimal research has been conducted thus far to evaluate the suitability of existing evaluation measures in the context of linking groups of records. As we show, evaluation measures such as precision and recall are not suitable for evaluating groups of linked records because they evaluate the quality of individually linked record pairs rather than the quality of records grouped into clusters. We highlight the shortcomings of traditional evaluation measures and then propose a novel approach to evaluate cluster quality in the context of group-based record linkage. We empirically evaluate our proposed approach using real-world data and show that it better reflects the quality of clusters generated by a group-based record linkage technique
    corecore